Dengue fever is a vector-borne infectious disease that is endemic in the tropical world. Singapore is one of several countries with high disease burden of dengue. In 2020, Singapore saw 1,158 dengue cases in a week of June - the highest number of weekly dengue cases ever recorded since 2014. Why is there a sudden spike in dengue cases this year?
In our Capstone Project, we ask two questions about potential factors affecting dengue cases. Our questions are not confirmatory in nature; rather, they are exploratory.
Does atmospheric variables influence the incidence of dengue cases? Specifically, is higher humidity, precipitation, or temperature associated with increased numbers of dengue cases?
Do local factors explain the different number of cases in different regions of Singapore?
For our data science project, we activated the following packages, using the Tidyverse approach.
# Rule for activating packages in development mode:
# - Specify everything fully
# - Exceptions:
# - ggfortify - to overload `ggplot2::autoplot()`
# - ggmap - has to be activated for `ggmap::register_google()` to work
# - ggplot2 - some function arguments require `ggplot2::`, it's crazy
# - magrittr - `%>%`
#
# For deployment mode, activate every package that is referenced more than once
# - Add package name here, then use find-and-replace to remove "<package>::"
# - Allowed to activate tidyverse
# - Remember what are the 8 packages?
pacman::p_load(
ggfortify,
# ggmap,
ggplot2,
magrittr
# tidyverse
)
Then, we imported our dataset.
The most time-consuming component of data science? This might take a while…
We work towards 2 separate datasets: data_time for temporal analysis, and data_space for spatial analysis. They will only be fully available after the tidy and transform process since there’s quite a bit of aggregating and joining involved.
We start with 7 raw datasets.
The main sources are:
1. Data.gov.sg
2. MOH Weekly Infectious Disease Bulletin
3. MOH Healthcare Institutions Directory
4. MSS Historical Daily Weather Records
Click here for a summary page.
The full glorious details of their providence are given below.
The following code chunk has been disabled, but is provided here to show the details behind the webscraping processes.
import_mss_daily <- function(years, stations = NULL) {
#' Historical Daily Weather Records
#'
#' @description
#' Daily weather records from the Meteorological Service Singapore (MSS).
#' Available data ranges from January 1980 to June 2020. Only 19 of the 63
#' climate stations are recognized by this function, because these contain
#' more than just rainfall data. Relevant links are given under details.
#'
#' MSS is nice enough to have their data in .csv files, with a systematic
#' naming scheme. Data compilation becomes as simple as generating the
#' appropriate list of URLs.
#'
#' @details
#' \href{http://www.weather.gov.sg/climate-historical-daily/}{MSS Daily Records}
#'
#' \href{http://www.weather.gov.sg/wp-content/uploads/2016/12/Station_Records.pdf}{List of stations, weather parameters and periods of records}
#'
#' @param years A vector of the years of interest.
#' @param stations A vector of the climate station names.
#' @return Combined daily weather records.
#' @examples
#' import_mss_daily(2012:2020, "Changi")
#' import_mss_daily(2012:2020, c("Changi", "Clementi", "Khatib", "Newton"))
stations_lookup = c(
"Admiralty" = "104_",
"Ang Mo Kio" = "109_",
"Changi" = "24_",
"Choa Chu Kang (South)" = "121_",
"Clementi" = "50_",
"East Coast Parkway" = "107_",
"Jurong (West)" = "44_",
"Jurong Island" = "117_",
"Khatib" = "122_",
"Marina Barrage" = "108_",
"Newton" = "111_",
"Pasir Panjang" = "116_",
"Pulau Ubin" = "106_",
"Seletar" = "25_",
"Sembawang" = "80_",
"Sentosa Island" = "60_",
"Tai Seng" = "43_",
"Tengah" = "23_",
"Tuas South" = "115_"
)
# Check that all provided station names are in the list, if not, exit and
# print out the list (of names) for the user.
mask = !(stations %in% names(stations_lookup))
if (any(mask)) {
stop("The following station names are not recognized:\n",
paste(stations[mask], collapse = "\n"),
"\n\nPlease select from the following:\n",
paste(names(stations_lookup), collapse = "\n"))
}
# If no station names specified, take the full list
if (is.null(stations)) {
station_nums = stations_lookup
} else {
station_nums = stations_lookup[stations]
}
# The URL to each .csv file has the following format:
# - "<url_base><station number>_<YYYY><MM>.csv"
# - Notice the station numbers given above contain the "_" suffix
url_base = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_S"
# To enumerate all the URLs, take the Cartesian product of:
# - station numbers
# - years
# - months (we'll always attempt to find all 12 months)
#
# Flatten the result into a vector, then prefix and suffix with `url_base`
# and ".csv", respectively.
# Base R
urls = station_nums %>%
outer(years, FUN = paste0) %>%
outer(sprintf("%02d", 1:12), FUN = paste0) %>%
sort() %>% # Flatten and arrange in alphabetical order
paste0(url_base, ., ".csv")
# Equivalent tidyverse approach (for reference)
# urls = station_nums %>%
# tidyr::crossing(years, sprintf("%02d", 1:12)) %>%
# tidyr::unite("station_year_month", sep = "") %>%
# dplyr::pull() %>%
# paste0(url_base, ., ".csv")
# We have a list of URLs, but some of them may not exist (some stations do
# not have data for some months). Use tryCatch() to return a table for the
# URLs that exist, and a NA for those that don't.
#
# Problems with multibyte characters and countermeasures:
# - Some colnames contained the "degree sign" symbol which somehow made the
# table contents inaccessible. Tables after April 2020 had slightly
# different colnames.
# - Manually set the colnames for each table.
# - Some tables contained the "em dash" symbol to indicate missing values.
# - They will be coerced to NA when the columns are type cast to numeric.
# - There will be warning messages.
dfs = urls %>%
lapply(function(url) {
tryCatch({
df = readr::read_csv(url,
skip = 1,
col_names = c(
"Station",
"Year",
"Month",
"Day",
"Rainfall",
"Highest_30_min_rainfall",
"Highest_60_min_rainfall",
"Highest_120_min_rainfall",
"Mean_temp",
"Max_temp",
"Min_temp",
"Mean_wind",
"Max_wind"
),
col_types = readr::cols_only(
"Station" = "c",
"Year" = "n",
"Month" = "n",
"Day" = "n",
"Rainfall" = "n",
"Mean_temp" = "n",
"Max_temp" = "n",
"Min_temp" = "n"
))
# Announce progress (UX is important! We can tolerate lower efficiency)
message(paste(df[1, 1:3], collapse = "_"))
df
}, error = function(e) {
NA
})
})
dfs[!is.na(dfs)] %>%
dplyr::bind_rows() %>%
# Calculate daily temperature range
dplyr::mutate(Temp_range = Max_temp - Min_temp,
.keep = "unused") %>%
# Calculate epidemiological years and weeks
dplyr::mutate(Date = lubridate::ymd(paste(Year, Month, Day, sep = "-")),
Epiyear = lubridate::epiyear(Date),
Epiweek = lubridate::epiweek(Date),
.keep = "unused",
.after = Station) %>%
dplyr::arrange(Station, Date)
}
import_hcidirectory <- function() {
#' Healthcare Institutions Directory
#'
#' @description
#' The \href{http://hcidirectory.sg/hcidirectory/}{Healthcare Institutions
#' (HCI) Directory}, an initiative by the Ministry of Health (MOH), is a
#' platform for all HCIs licensed under the Private Hospitals and Medical
#' Clinics (PHMC) Act to provide information about their services and
#' operations to the public.
#'
#' This function is custom-made to consolidate the names and addresses of
#' HCIs which are medical clinics that offer general medical services.
#'
#' The HCI Directory is a dynamic web page, so using RSelenium might be
#' required.
#'
#' @return The names and addresses of selected HCIs.
# Run a Selenium Server using `RSelenium::rsDriver()`. The parameters e.g.
# `browser`, `chromever` (or `geckover` if using Firefox, or other drivers
# if using other browsers) have to be properly set. Trial-and-error until a
# configuration works. Set `check = T` the very first time it's run on a
# system, then set `check = F` after that to speed things up.
rD = RSelenium::rsDriver(browser = "chrome",
chromever = "83.0.4103.39",
check = F)
# Connect to server with a remoteDriver instance.
remDr = rD$client
# Set timeout on waiting for elements
remDr$setTimeout(type = "implicit", milliseconds = 10000)
# Navigate to the given URL
remDr$navigate("http://hcidirectory.sg/hcidirectory/")
# Click 4 things:
# 1. "MORE SEARCH OPTIONS"
# 2. "Medical Clinics Only"
# 3. "General Medical"
# 4. "Search"
c(
"options" = "#moreSearchOptions",
"medclins" = "#criteria > table > tbody > tr:nth-child(2) > td > label",
"genmed" = "#isGenMed",
"search" = "#search_btn_left"
) %>%
lapply(remDr$findElement, using = "css") %>%
purrr::walk(function(elem) elem$clickElement())
# Find the number of pages
results = remDr$findElement("#results", using = "css")
npages = results$getElementAttribute("innerHTML")[[1]] %>%
xml2::read_html() %>%
rvest::html_node("#totalPage") %>%
rvest::html_attr("value") %>%
as.numeric()
# Create an empty tibble to append results
df = tibble::tibble(
id = character(),
name = character(),
add = character()
)
i = 1
while (T) {
results = remDr$findElement("#results", using = "css")
html = results$getElementAttribute("innerHTML")[[1]] %>%
xml2::read_html()
# Determine the index numbers of the (up to 10) results on the page
idx = html %>%
# Find the element that says "SHOWING 1 - 10 OF 1,761 RESULTS"
rvest::html_nodes(".col1") %>%
.[1] %>%
rvest::html_text() %>%
# Commas have to be eliminated for numbers > 999
gsub(",", "", .) %>%
# Find the smallest and largest numbers and apply the colon operator
sub(".*Showing\\s+(.*)\\s+of.*", "\\1", .) %>%
strsplit(split = " - ") %>%
unlist() %>%
as.numeric() %>%
{ .[1]:.[2] }
# Only append results if IDs are not in the table
if (!any(idx %in% df$id)) {
df = df %>%
dplyr::bind_rows(
html %>%
# Find both the name and address nodes
rvest::html_nodes(".name,.add") %>%
rvest::html_text() %>%
# Tidy whitespace
gsub("\\s+", " ", .) %>%
trimws() %>%
# Concatenate IDs, odd rows (names), and even rows (addresses)
{ cbind(idx, .[c(TRUE,FALSE)], .[!c(TRUE,FALSE)]) } %>%
tibble::as_tibble() %>%
setNames(c("id", "name", "add"))
)
# Announce progress and increment page counter
message(i, " of ", npages, " done (", round(i / npages * 100, 2), "%)")
i = i + 1
}
# Natural exit point
if (i > npages) break
# Navigate to the next page (if available, else stop)
the_end = tryCatch({
nextpage = remDr$findElement("#PageControl > div.r_arrow", using = "css")
nextpage$clickElement()
F
}, error = function(e) {
print(paste("There are no more pages after", i))
T
})
# Unnatural exit point
if (the_end) break
}
# Clean up RSelenium
remDr$close()
rD[["server"]]$stop()
rm(rD, remDr)
gc()
# Kill Java instance(s) inside RStudio
# docs.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
system("taskkill /im java.exe /f", intern = F, ignore.stdout = F)
# Clean up:
# - Franchises may have the same name with different addresses
# - Different practices may have the same zipcodes and even buildings
# - We will consider each full address unique, and a single practice
# Clean up duplicate addresses
df %>%
.[!duplicated(tolower(.$add), fromLast = T),]
}
zipcodes_to_geocodes <- function(zipcodes) {
#' Get Geo-location from Google Maps
#'
#' @description
#' Attempt to obtain the longitudes, latitudes, and addresses of the given
#' zipcodes using ggmap::geocode().
#'
#' @param zipcodes A vector of zipcodes.
#' @return Geo-location data of the associated zipcodes.
# Prompt user to input API key
ggmap::register_google(key = readline("Please enter Google API key: "))
# Create an (almost) empty tibble to append results
res = zipcodes %>%
# Remove duplicates to minimize number of requests
.[!duplicated(.)] %>%
tibble::as_tibble() %>%
dplyr::rename(zip = value) %>%
dplyr::mutate(lon = NA_real_,
lat = NA_real_,
address = NA_character_)
for (i in 1:nrow(res)) {
result = tryCatch({
ggmap::geocode(res$zip[i], output = "latlona", source = "google")
}, warning = function(w) {
w$message
}, error = function(e) {
NA
})
# If the registered key is invalid, there's no point continuing
if (grepl("The provided API key is invalid", result[1], fixed = T)) {
stop("A valid Google API key is required.")
}
# A useful result will have something, and will have names
if (!is.na(result) && !is.null(names(result))) {
res$lon[i] = result$lon
res$lat[i] = result$lat
res$address[i] = result$address
}
# Announce progress
message(i, " of ", nrow(res), " (",round(i / nrow(res) * 100, 2), "%)")
}
res
}
The following functions must be activated.
import_moh_weekly <- function(url_or_path) {
#' Weekly Infectious Diseases Bulletin
#'
#' @description
#' Weekly infectious diseases bulletin from the Ministry of Health (MOH).
#' Data from 2012-W01 to 2020-W29 (as of 24 July 2020). Relevant links are
#' given under details.
#'
#' @details
#' \href{https://www.moh.gov.sg/resources-statistics/infectious-disease-statistics/2020/weekly-infectious-diseases-bulletin}{MOH Weekly Infectious Disease Bulletin}
#'
#' \href{https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/weekly-infectious-disease-bulletin-year-202071e221d63d4b4be0aa2b03e9c5e78ac2.xlsx}{Latest data as of 24 July 2020 (2012-W01 to 2020-W29)}
#'
#' \href{https://raw.githubusercontent.com/roscoelai/dasr2020capstone/master/data/moh/weekly-infectious-disease-bulletin-year-202071e221d63d4b4be0aa2b03e9c5e78ac2.xlsx}{Backup copy as of 24 July 2020 (2012-W01 to 2020-W29)}
#'
#' @param url_or_path The URL or file path of the .xlsx file.
#' @return Weekly infectious diseases bulletin (2012-W01 to 2020-W29).
# Columns will be renamed to follow 2020
colnames_2020 = c(
"Campylobacter enterosis" = "Campylobacter enteritis",
"Campylobacterenterosis" = "Campylobacter enteritis",
"Campylobacteriosis" = "Campylobacter enteritis",
"Chikungunya Fever" = "Chikungunya",
"Dengue Haemorrhagic Fever" = "DHF",
"Dengue Fever" = "Dengue",
"Hand, Foot and Mouth Disease" = "HFMD",
"Hand, Foot Mouth Disease" = "HFMD",
"Nipah virus infection" = "Nipah",
"Viral Hepatitis A" = "Acute Viral Hepatitis A",
"Viral Hepatitis E" = "Acute Viral Hepatitis E",
"Zika Virus Infection" = "Zika",
"Zika virus infection" = "Zika"
)
# Check if the given path is a URL by trying to download to a temp file. If
# successful, return the temp file. If not, return the original path.
xlsx_file = tryCatch({
temp = tempfile(fileext = ".xlsx")
download.file(url_or_path, destfile = temp, mode = "wb")
temp
}, error = function(e) {
url_or_path
})
xlsx_file %>%
readxl::excel_sheets() %>%
lapply(function(sheetname) {
df = readxl::read_xlsx(xlsx_file, sheetname, skip = 1)
# Date formats are different for 2020
if (sheetname == "2020") {
df$Start = lubridate::dmy(df$Start)
df$End = lubridate::dmy(df$End)
}
# Find and rename columns that need to be renamed
mapper = na.omit(colnames_2020[names(df)])
dplyr::rename_with(df, ~mapper, names(mapper))
}) %>%
dplyr::bind_rows() %>%
dplyr::rename(Epiweek = `Epidemiology Wk`) %>%
dplyr::mutate(Epiyear = lubridate::epiyear(Start)) %>%
dplyr::select(Epiyear, everything()) %>%
dplyr::arrange(Start)
}
read_kmls <- function(url_or_path) {
# There are (at least) 2 approaches to handling .kml data:
# 1. sp - rgdal::readOGR()
# 2. sf - sf::st_read()
#
# The sp approach was quickly abandoned as their objects were rather complex
# and did not facilitate method chaining.
#
# The sf approach produced objects that look like data.frames, which aided
# method chaining, but had their own peculiarities:
# - Dimension: set to XY using sf::st_zm()
# - Geographic CRS: use sf::`st_crs<-`("WGS84") World Geodetic Survey 1984
# Check if the given paths are URLs by trying to download to temp files. If
# successful, return the temp files. If not, return the original paths.
kml_files = tryCatch({
temp = tempfile(fileext = rep(".kml", length(url_or_path)))
Map(function(x, y) download.file(x, y, mode = "wb"), url_or_path, temp)
temp
}, error = function(e) {
url_or_path
})
kml_files %>%
lapply(sf::st_read, quiet = T) %>%
dplyr::bind_rows() %>%
tibble::as_tibble() %>%
sf::st_as_sf() %>%
sf::st_zm()
}
The following code chunk has been disabled, but is provided here to show how the data may be obtain from primary sources (except for 1). Using this code chunk would require a valid Google Maps Platform API key.
grand_import_de_novo <- function() {
# This might take a while, and requires a Google Maps Platform API key
list(
"moh_bulletin" = import_moh_weekly(paste0(
"https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/",
"weekly-infectious-disease-bulletin-year-2020",
"71e221d63d4b4be0aa2b03e9c5e78ac2.xlsx"
)),
"mss_19stations" = import_mss_daily(years = 2012:2020),
"hci_clinics" = import_hcidirectory() %>%
dplyr::mutate(zip = sub(".*((?i)singapore\\s+\\d+).*", "\\1", add)) %>%
# Requires Google Maps Platform API key
dplyr::left_join(zipcodes_to_geocodes(.$zip), by = "zip") %>%
dplyr::select(-id, -zip, -address) %>%
tidyr::drop_na(),
"planning_areas" = read_kmls(paste0(
"https://geo.data.gov.sg/plan-bdy-dwelling-type-2017/2017/09/27/kml/",
"plan-bdy-dwelling-type-2017.kml"
)),
"dengue_polys" = read_kmls(
c("central", "northeast", "southeast", "southwest") %>%
paste0("https://geo.data.gov.sg/denguecase-", .,
"-area/2020/07/17/kml/denguecase-", ., "-area.kml")
),
"aedes_polys" = read_kmls(c(
c("central", "northeast", "northwest") %>%
paste0("https://geo.data.gov.sg/breedinghabitat-", .,
"-area/2020/07/17/kml/breedinghabitat-", ., "-area.kml"),
c("southeast", "southwest") %>%
paste0("https://geo.data.gov.sg/breedinghabitat-", .,
"-area/2020/07/23/kml/breedinghabitat-", ., "-area.kml")
)),
"mss_63station_pos" = readr::read_csv(paste0(
"https://raw.githubusercontent.com/roscoelai/dasr2020capstone/master/",
"data/mss/Station_Records.csv"
))
)
}
# raw_data <- grand_import_de_novo()
If the data files are available locally, set
from_online_repo = Ffor faster loading. Otherwise, the file will be obtained from an online repository. The .csv files will be read directly while the other file formats will be downloaded to a temporary folder and read locally.
grand_import_no_webscraping <- function(from_online_repo = TRUE) {
# Allow user to choose whether to import raw data from an online repository
# or from local files.
if (from_online_repo) {
fld = paste0("https://raw.githubusercontent.com/roscoelai/",
"dasr2020capstone/master/data/")
} else {
# Check that the "../data/ folder exists
assertthat::assert_that(dir.exists("../data/"),
msg = 'Unable to locate "../data/" directory.')
fld = "../data/"
}
list(
"moh_bulletin" = import_moh_weekly(paste0(
fld, "moh/weekly-infectious-disease-bulletin-year-2020.xlsx"
)),
"mss_19stations" = readr::read_csv(paste0(
fld, "mss/mss_daily_2012_2020_19stations_20200728.csv"
)),
"hci_clinics" = readr::read_csv(paste0(
fld, "hcid/hci_clinics_20200725.csv"
)),
"planning_areas" = read_kmls(paste0(
fld, "kmls/plan-bdy-dwelling-type-2017.kml"
)),
"dengue_polys" = read_kmls(paste0(
fld, "kmls/denguecase-", c("central",
"northeast",
"southeast",
"southwest"), "-area.kml"
)),
"aedes_polys" = read_kmls(paste0(
fld, "kmls/breedinghabitat-", c("central",
"northeast",
"northwest",
"southeast",
"southwest"), "-area.kml"
)),
"mss_63station_pos" = readr::read_csv(paste0(
fld, "mss/Station_Records.csv"
))
)
}
raw_data <- grand_import_no_webscraping(from_online_repo = F)
names(raw_data)
[1] "moh_bulletin" "mss_19stations" "hci_clinics"
[4] "planning_areas" "dengue_polys" "aedes_polys"
[7] "mss_63station_pos"
# Too verbose
# raw_data %>%
# purrr::walk(dplyr::glimpse)
The tidying process can (ironically) be tidied. It takes too long to get to
data_space. Relook the flow of logic. And too many reassignments!
data_time.# Minimize size of datasets, choose "easier" column names
data_time <- raw_data$moh_bulletin %>%
dplyr::select(Epiyear:End,
Dengue,
HFMD,
`Acute Upper Respiratory Tract infections`,
`Acute Diarrhoea`) %>%
dplyr::rename(URTI = `Acute Upper Respiratory Tract infections`,
Diarrhoea = `Acute Diarrhoea`) %>%
tidyr::drop_na() %>%
dplyr::left_join(
raw_data$mss_19stations %>%
# Aggregation schema (up to 19 stations x up to 7 days -> 1 value)
dplyr::group_by(Epiyear, Epiweek) %>%
dplyr::summarise(Mean_rainfall = mean(Rainfall, na.rm = T),
Med_rainfall = median(Rainfall, na.rm = T),
Mean_temp = mean(Mean_temp, na.rm = T),
Med_temp = median(Mean_temp, na.rm = T),
Mean_temp_rng = mean(Temp_range, na.rm = T),
Med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop"),
by = c("Epiyear", "Epiweek")
)
data_space requires a lot more wrangling.data <- list(
"dengue_points" = raw_data$dengue_polys %>%
dplyr::transmute(n = sub(".*: (\\d+).*", "\\1", Description)) %>%
sf::st_centroid() %>%
.[rep(1:nrow(.), as.numeric(.$n)),],
"aedes_points" = raw_data$aedes_polys %>%
dplyr::transmute(n = sub(".*: (\\d+).*", "\\1", Description)) %>%
sf::st_centroid() %>%
.[rep(1:nrow(.), as.numeric(.$n)),],
"clinic_points" = raw_data$hci_clinics %>%
sf::st_as_sf(coords = c("lon", "lat")) %>%
sf::`st_crs<-`("WGS84"),
"weather_points" = raw_data$mss_19stations %>%
dplyr::filter(Epiyear == 2020) %>%
# Filter for the last 3 weeks
dplyr::filter(Epiweek > max(Epiweek) - 3) %>%
# Aggregation schema (up to 7 days x up to 3 weeks -> 1 value)
dplyr::group_by(Station) %>%
dplyr::summarise(mean_rainfall = mean(Rainfall, na.rm = T),
med_rainfall = median(Rainfall, na.rm = T),
mean_temp = mean(Mean_temp, na.rm = T),
med_temp = median(Mean_temp, na.rm = T),
mean_temp_rng = mean(Temp_range, na.rm = T),
med_temp_rng = median(Temp_range, na.rm = T),
.groups = "drop") %>%
tidyr::drop_na() %>%
dplyr::left_join(
raw_data$mss_63station_pos %>%
sf::st_as_sf(coords = c("Long. (E)", "Lat.(N)")) %>%
sf::`st_crs<-`("WGS84") %>%
dplyr::select(Station),
by = "Station"
) %>%
sf::st_as_sf(),
"planning_areas" = raw_data$planning_areas %>%
dplyr::bind_cols(
# Extract data from HTML in the Description column (dwelling types)
.$Description %>%
lapply(function(x) {
xml2::read_html(x) %>%
rvest::html_node("table") %>%
rvest::html_table() %>%
t() %>%
`colnames<-`(.[1,]) %>%
.[2, 1:10]
}) %>%
dplyr::bind_rows()
) %>%
dplyr::select(-Name, -Description) %>%
dplyr::rename_all(tolower) %>%
dplyr::rename(plan_area = pln_area_n,
pop = total) %>%
dplyr::mutate(plan_area = tools::toTitleCase(tolower(plan_area)),
dplyr::across(pop:others, as.numeric),
area_km2 = units::set_units(sf::st_area(.), km^2)) %>%
tibble::as_tibble() %>%
sf::st_as_sf()
)
idw_interpolation <- function(points, polys, ordinal = 2) {
# Inverse-distance-weighted interpolation
weights = polys %>%
sf::st_centroid() %>%
sf::st_distance(points) %>%
# Small ordinal: Unweighted average
# Large ordinal: Proximity (Thiessen) interpolation
{ 1 / (. ^ ordinal) }
values = points %>%
as.data.frame() %>%
dplyr::select(-Station, -geometry) %>%
as.matrix()
(weights %*% values / rowSums(weights)) %>%
tibble::as_tibble()
}
# TODO: Can we avoid reassignment?
data$planning_areas <- data$planning_areas %>%
dplyr::bind_cols(idw_interpolation(data$weather_points, .)) %>%
tibble::as_tibble() %>%
sf::st_as_sf()
npts_in_polys <- function(points, polys, colname = "n") {
sf::st_intersects(points, polys) %>%
tibble::as_tibble() %>%
dplyr::mutate(plan_area = polys$plan_area[col.id]) %>%
dplyr::count(plan_area, name = colname)
}
data_space <- data %>%
{
list(
npts_in_polys(.$dengue_points, .$planning_areas, "ncases"),
npts_in_polys(.$aedes_points, .$planning_areas, "nhabs"),
npts_in_polys(.$clinic_points, .$planning_areas, "nclinics"),
.$planning_areas
)
} %>%
Reduce(function(x, y) dplyr::left_join(x, y, by = "plan_area"), .) %>%
dplyr::mutate(
popden = pop / area_km2,
caseden = as.numeric(ncases / area_km2),
label = paste0(
"<p><b>", plan_area,
"</b><br/>Area: ", round(area_km2, 2),
"<br/>Cases: ", ncases,
"<br/>Clinics: ", nclinics,
"<br/>Population: ", pop,
"<br/><i>Aedes</i> habitats: ", nhabs,
"</p>"
)
) %>%
tibble::as_tibble() %>%
sf::st_as_sf()
We’ve finally got data_time and data_space!
data_time contains number of cases per epidemiological week (Y) for dengue and a few other diseases from 2012-W01 to 2020-W29, as well as the aggregated weekly mean and median rainfall, temperature, and temperature range (X) from 2012-W01 to 2020-W27. We have 2 weeks of missing values :sob:
data_space contains various information for each planning area (URA MP14), limited to the planning areas with any data on dengue case numbers. Notably, the north-west region is missing as no data were available. Information includes: dengue cases (Y), number of Aedes habitats, number of clinics, population, area, meteorological variables, numbers of each dwelling type, etc.
data_time %>%
dplyr::select(-matches("Epiweek|End")) %>%
dplyr::mutate(Epiyear = as.factor(Epiyear)) %>%
tidyr::pivot_longer(Dengue:Med_temp_rng) %>%
ggplot(aes(x = Start, y = value, color = Epiyear)) +
geom_line() +
geom_point(alpha = 0.3) +
facet_grid(name ~ ., scales = "free_y") +
labs(title = "Weekly history of variables from 2012 to 2020",
x = "",
y = "",
caption = "Sources: moh.gov.sg, weather.gov.sg") +
theme(legend.position = "none")
data_time %>%
dplyr::select(-matches("Epi|End|Start")) %>%
# Try some transformations
# dplyr::mutate(Dengue = Dengue - mean(Dengue)) %>%
dplyr::mutate(Dengue_sqrt = sqrt(Dengue),
Dengue_log = log10(Dengue)) %>%
tidyr::pivot_longer(everything()) %>%
ggplot(aes(x = value, color = name)) +
geom_density(size = 1) +
facet_wrap(name ~ ., scales = "free") +
labs(x = "",
caption = "Sources: moh.gov.sg, weather.gov.sg") +
theme(legend.position = "none")
data_time %>%
dplyr::select(Epiyear, Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
dplyr::mutate(Dengue = dplyr::lag(Dengue, 1),
Epiyear = as.factor(Epiyear)) %>%
tidyr::pivot_longer(Mean_rainfall:Med_temp_rng) %>%
ggplot(aes(x = value, y = Dengue, color = name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", formula = y ~ x, size = 0.5) +
facet_grid(Epiyear ~ name, scales = "free") +
labs(x = "",
caption = "Sources: moh.gov.sg, weather.gov.sg") +
theme(legend.position = "none")
data_time %>%
dplyr::select(-Epiyear:-End) %>%
names() %>%
lapply(function(x) {
data_time[[x]] %>%
ts(start = 2012, frequency = 52) %>%
decompose(type = "additive") %>%
autoplot() +
labs(title = x)
})
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
data_space %>%
leaflet::leaflet(width = "100%") %>%
leaflet::addTiles() %>%
leaflet::addPolygons(
weight = 1,
opacity = 1,
fillOpacity = 0.6,
smoothFactor = 0.5,
fillColor = ~leaflet::colorNumeric("Reds", caseden)(caseden),
label = ~lapply(label, htmltools::HTML),
popup = ~lapply(label, htmltools::HTML)
) %>%
leaflet::addCircleMarkers(
data = data$dengue_points,
color = "red",
radius = 5,
fillOpacity = 0.5,
clusterOptions = leaflet::markerClusterOptions()
) %>%
leaflet::addLabelOnlyMarkers(
data = sf::st_centroid(data_space),
label = ~plan_area,
labelOptions = leaflet::labelOptions(
noHide = T,
textOnly = T,
direction = "center",
style = list("color" = "blue"))
)
names(data_space)
[1] "plan_area" "ncases" "nhabs"
[4] "nclinics" "pop" "hdb"
[7] "one_to_two_rm" "three_rm" "four_rm"
[10] "five_rm_exec_flats" "condos_other_apts" "landed_properties"
[13] "others" "area_km2" "mean_rainfall"
[16] "med_rainfall" "mean_temp" "med_temp"
[19] "mean_temp_rng" "med_temp_rng" "geometry"
[22] "popden" "caseden" "label"
data_space %>%
dplyr::select(ncases:pop, mean_rainfall:med_temp_rng, geometry) %>%
gridExtra::grid.arrange(grobs = lapply(list(
ggplot(aes(fill = ncases), data = .),
ggplot(aes(fill = nhabs), data = .),
ggplot(aes(fill = nclinics), data = .),
ggplot(aes(fill = pop), data = .),
ggplot(aes(fill = mean_rainfall), data = .),
ggplot(aes(fill = med_rainfall), data = .),
ggplot(aes(fill = mean_temp), data = .),
ggplot(aes(fill = med_temp), data = .),
ggplot(aes(fill = mean_temp_rng), data = .),
ggplot(aes(fill = med_temp_rng), data = .)
), function(x) {
x +
geom_sf() +
scale_fill_viridis_c(guide = guide_colourbar(
title.position = "top",
title.hjust = .5,
barwidth = 10,
barheight = .5
)) +
theme_void() +
theme(legend.position = "bottom")
}), ncol = 2)
data_space %>%
dplyr::select(ncases:pop, area_km2, geometry) %>%
dplyr::mutate_at(dplyr::vars(-geometry), ~(. / area_km2)) %>%
dplyr::select(-area_km2) %>%
dplyr::mutate_at(dplyr::vars(-geometry), as.numeric) %>%
gridExtra::grid.arrange(grobs = lapply(list(
ggplot(aes(fill = ncases), data = .),
ggplot(aes(fill = nhabs), data = .),
ggplot(aes(fill = nclinics), data = .),
ggplot(aes(fill = pop), data = .)
), function(x) {
x +
geom_sf() +
scale_fill_viridis_c(guide = guide_colourbar(
title.position = "top",
title.hjust = .5,
barwidth = 10,
barheight = .5
)) +
theme_void() +
theme(legend.position = "bottom")
}), ncol = 2)
For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.
data_time %>%
# dplyr::select(Dengue:Diarrhoea, Mean_rainfall, Med_temp, Med_temp_rng) %>%
dplyr::select(Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
as.matrix() %>%
Hmisc::rcorr() %>%
broom::tidy() %>%
dplyr::filter(p.value < 0.05) %>%
dplyr::arrange(estimate)
# A tibble: 3 x 5
column1 column2 estimate n p.value
<chr> <chr> <dbl> <int> <dbl>
1 Med_temp Mean_rainfall -0.499 444 0
2 Med_temp_rng Dengue -0.151 444 0.00142
3 Med_temp Dengue 0.201 444 0.0000193
We performed mean-centering transformations on all the variables that will be turned into interaction terms.
data_time %>%
dplyr::select(Dengue, Med_temp, Med_temp_rng) %>%
dplyr::mutate_all(~(. - mean(., na.rm = T))) %>%
# scale() %>%
tibble::as_tibble() %>%
dplyr::mutate(Dengue_sqrt = sqrt(Dengue),
Dengue_log = log10(Dengue),
Dengue_inv = (1 / Dengue)) %>%
tidyr::pivot_longer(everything()) %>%
ggplot(aes(x = value, color = name)) +
geom_density(size = 1) +
facet_wrap(name ~ ., scales = "free") +
theme(legend.position = "none")
data_time %>%
dplyr::select(Dengue, Med_temp, Med_temp_rng) %>%
dplyr::mutate_all(~(. - mean(., na.rm = T))) %>%
# scale() %>%
tibble::as_tibble() %>%
tidyr::pivot_longer(-Dengue) %>%
ggplot(aes(x = value, y = Dengue, color = name)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
facet_grid(. ~ name, scales = "free") +
theme(legend.position = "none")
lm(sqrt(Dengue) ~ Med_temp + Med_temp_rng,
data = data_time %>%
dplyr::select(Dengue, Med_temp, Med_temp_rng) %>%
# dplyr::mutate_all(~(. - mean(., na.rm = T))) %>%
# scale() %>%
tibble::as_tibble()) %>%
# summary()
gvlma::gvlma() %>%
autoplot()
data_time %>%
dplyr::select(Epiyear, Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
dplyr::mutate(Dengue_w1 = Dengue,
Dengue = dplyr::lag(Dengue, 1),
Epiyear = as.factor(Epiyear)) %>%
# dplyr::select(-Epiyear) %>%
# as.matrix() %>%
# Hmisc::rcorr() %>%
# broom::tidy() %>%
# dplyr::filter(p.value < 0.05) %>%
# dplyr::arrange(estimate)
# dplyr::select(matches("Epi|Dengue")) %>%
# tidyr::pivot_longer(Dengue_w1) %>%
# ggplot(aes(x = value, y = Dengue, color = name)) +
# geom_point(alpha = 0.4) +
# geom_smooth(method = "lm", formula = y ~ x) +
# facet_grid(name ~ Epiyear, scales = "free") +
# theme(legend.position = "none")
lm(Dengue ~ Dengue_w1 * (Med_temp + Med_temp_rng), .) %>%
# gvlma::gvlma()
autoplot()
# summary()
We ran two regression models. The first regressed demographics and social comparisons orientations onto life satisfaction (model1). \[
\begin{eqnarray}
\widehat{swl} = intercept + b_1fem + b_2ageyear + b_3educ + b_4income + b_5sca + b_6scb + \epsilon
\end{eqnarray}
\] Our key investigation lies in the next model, in which we regressed demographics and social comparisons orientations, along with interaction terms, onto life satisfaction (model2). \[
\begin{eqnarray}
\widehat{swl} = intercept + b_1fem + b_2ageyear + b_3educ + b_4income + b_5sca + b_6scb + \\
+ b_7fem \times sca + b_8ageyear \times sca + b_9educ \times sca + b_10educ \times sca + \epsilon
\end{eqnarray}
\]
# model1 <- lm(swl ~ fem + ageyear + educ + income + sca + scb,
# data2)
#
# tidy(model1) %>% as_tibble()
# glance(model1)
# model2 <- lm(swl ~ (fem + ageyear + educ + income) *sca + scb,
# data2)
#
# tidy(model2) %>% as_tibble()
# glance(model2)
We tested if model2, with interaction terms, enhances the explanatory power of the model using anova function.
# anova(model1, model2)
The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.
Prof. Roh’s Note: "Here, please check the linearity assumption, using Global Validation of Linear Model Assumption (
gvlma) package. You may visualize the core infomation of assumption checks, usingggfortifypackage.
# library(gvlma)
# gvlma(model2)
#
# library(ggfortify)
# autoplot(gvlma(model2))
# library(car)
# vif(model1);vif(model2)
kable in R MarkdownProf. Roh’s Note: “Now that the assumption check is done, you might want to put the results into a prettier format of table. The default print-out of table in
R Markdowndoes not look good. Theknitrpackage contains a very basic command,kable, which will format an array or data frame more presentable for display. Thus, use the following for your report.”
# library(knitr) # Please install the package "knitr" first.
# library(kableExtra) # You might want to use package "kableExtra" as well.
#
# kable(tidy(model2))%>%
# kable_styling("striped", full_width = F) %>%
# column_spec(c(1, 5), bold = T) %>%
# row_spec(c(2, 4, 6, 8, 10), bold = T, color = "white", background = "red")
#
# kable(glance(model2))%>%
# kable_styling("striped", full_width = F) %>%
# column_spec(c(2, 4, 6, 8, 10, 12), bold = T, color = "white", background = "red")
The regression analysis came up with two significant interaction terms.
First, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s gender.
Second, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s education.
To see the patterns of interaction, we visualized the significant interaction effects on the next chapter.
Unlike gender, which is a categorical variable, education is a continuous variable in our model. To make it easier for interpretation, we categorized them into three levels (mean below 1SD, mean, and mean above 1SD). We set gender at 0.5 instead of 0, as the ratio of male and female is equal.
# data2 %>% summarize(sd(educ))
#
# grid_group3 <- data2 %>%
# data_grid(sca, educ = c(-1.26, 0.00, 1.26),
# fem = 0.5, ageyear = 0, income = 0, scb = 0) %>%
# add_predictions(model2)
We undid the centering of variable (sca).
# grid <- grid_group3 %>%
# mutate(sca = sca + mean(data$sca), educ = factor(as.double(factor(educ))))
The following figure represents the three lines that represent differing education levels set at M-1SD, Mean, M+1SD, as noted above, and how differing education levels make differences to relationships between social comparison orientation and life satisfaction.
# ggplot(grid, aes(x = sca, y = pred, color = factor(educ))) +
# geom_line(size = 2) +
# scale_color_discrete(breaks = c(1, 2, 3),
# label=c("Low in Education",
# "Mean Education",
# "High in Education")) +
# labs(x = "Social Comparison Orientation (Abilities)",
# y = "Life Satisfaction",
# color = "Education") +
# coord_cartesian(ylim = c(2.0, 3.5), xlim = c(0.7, 5.3)) +
# theme_linedraw() +
# theme(legend.position= "top")
Prof. Roh’s Note: “Yes, you might want to perform the analysis above, using
jtools,huxtable,ggstance, andinteractionspackages, as you have learned in class.”
# pacman::p_load(jtools, huxtable, ggstance, interactions)
#
# m1 <- lm(swl ~ fem + ageyear + educ + income + sca + scb,
# data = data)
#
# m2 <- lm(swl ~ (fem + ageyear + educ + income) * sca + scb,
# data = data)
#
# export_summs(m1, m2,
# error_format = "(t = {statistic}, p = {p.value})",
# align = "right",
# model.names = c("Main Effects Only", "Interplay of Social Comparison X Demographics"),
# digits = 3)
#
# plot_summs(m1, m2,
# scale = T,
# plot.distributions = T,
# model.names = c("Main Effects Only", "Interplay of Social Comparison X Demographics")) +
# theme(legend.position = "top")
#
# sim_slopes(m2,
# pred = sca,
# modx = fem,
# johnson_neyman = F)
#
# sim_slopes(m2,
# pred = fem,
# modx = sca,
# johnson_neyman = T)
#
# interact_plot(m2,
# pred = "sca",
# modx = "fem",
# modx.labels = c("Male",
# "Female"),
# interval = T,
# int.width = 0.95,
# colors = c("dodgerblue",
# "darkorange"),
# vary.lty = T,
# line.thickness = 1,
# legend.main = "Gender") +
# geom_vline(xintercept = 3.19, col = "red", linetype = 1, size = 1) +
# labs(title = "The Interplay of Gender and Social Comparison\non Life Satisfaction",
# subtitle = "Among females, the more they compare their abilities with others,\nthere seems to be lesser life satisfaction.",
# x = "Social Comparison of Abilities",
# y = "Life Satisfaction",
# caption = "Source: Your Dataset") +
# annotate("text",
# x = 2,
# y = 2,
# label = "The shaded areas denote 95% confidence intervals.\nThe vertical line marks the boundary\nbetween regions of significance and non-significance\nbased on alpha at 5%") +
# theme(legend.position = "top",
# text = element_text(family = "Courier"))
From our data science project, we could find the following two findings:
The relationships between the tendency of people to compare themselves to others’ abilities and life satisfaction differ depending on one’s gender. Specifically, there appears to be no relationship between social comparisons orientation and life satisfaction among males. On the other hand, among females, the more they compare their abilities with others, there seems to be lesser life satisfaction. Thus, social comparison seems to be harmful for life satisfaction among females, whereas there seems to be no relationships between social comparisons and life satisfaction among males. (You might want to highlight that the relationships between social comparisons orientation and life satisfaction is based on observational study, leading to correlations, not causations, in Limitations and Future Directions section below).
The relationships between social comparison and life satisfaction also depends on one’s education level. It appears that, among those who received average and high levels of education, the greater the social comparison tendency, the lower the life satisfaction. Such a negative relationship between social comparison and life satisfaction was not found among those with relatively lower levels of education. Social comparison regarding one’s abilities can hurt one’s life satisfaction, when one receives average and above-average levels of education (again, acknowledge that the findings are correlational though, thus further investigation with a/b testing should follow).
Prof. Roh’s Note: “This is where you provide the significance of the findings. Unlike the other sections, where your goal is to describe the results that you found (what the data told you). This is where you chime in and proactively discuss the meaning of the results.”
Prof. Roh’s Note: “Acknowledging limitations is not where you just provide a laundry list of what is missing and what should have done. Please take the responsibility of your analyses and inform your readers to understand what the results tell and don’t (or can’t) tell. More importantly, this is the section where you technically begin your next data science project, by highlighting what would be informative down the line to shed further light on what you have found here.”
References
Appendix
Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”